Candidate SNP identification & Genomic offset predictions
Author
Juliette Archambeau
Published
May 11, 2023
1 Introduction
Document based on Gain and François (2021) and the LEA tutorial provided during the summer school Software and Statistical Methods for Population Genetics (SSMPG 2022; Aussois, September 19-23 2022).
The latent factor mixed model (LFMM) is a multivariate mixed regression model that estimates simultaneously the effects of environmental variables (fixed effects) and unobserved confounders called latent factors (Frichot et al. 2013; Caye et al. 2019). The latent factors are computed both from the genomes and from their environment. They are not representing neutral population structure (i.e. they have less direct interpretations than in ancestry estimation methods). Instead, they can be interpreted as the best estimates of the confounding effects of neutral population structure, leading to environmental effect size estimates with minimal bias.
2 Population structure
COMMENT: this section is not used in the paper (=> can be skipped!). Indeed, the analyses in this section are only useful if we want to estimate the population structure and impute missing data with the LEA package. In our study, we use the population structure estimates (the ancestry coefficients) from Jaramillo-Correa et al. (2015), and we impute the missing data based on the most common allele in the gene pool. And in the RDA, we do not use the ancestry coefficients to account for population structure; we use PC scores.
From the LEA package manual: The function snmf of the lfmm2 package estimates admixture coefficients using sparse Non-Negative Matrix Factorization algorithms, and provides STRUCTURE-like outputs. The input file has to be in the geno format, with one row for each SNP. Each row contains 1 character for each individual: 0 means zero copy of the reference allele. 1 means one copy of the reference allele. 2 means two copies of the reference allele. 9 means missing data.
Assuming \(n\) diploid organisms genotyped at \(L\) loci, the snmf algorithm decomposes the \(n × L\) matrix of observed allele frequencies, \(P\), in a product of two probabilistic matrices:
\[P \approx QF\]
where the coefficients of \(P\) take their values in {0,1/2,1} for diploid organisms (note: ploidy can be modified in snmf). The matrix \(Q\) is similar to the \(Q\) matrix of STRUCTURE, representing ancestry coefficients for individuals originating from \(K\) source populations (Pritchard, Stephens, and Donnelly 2000), and is a \(n × K\) matrix. The matrix \(F\) contains allele frequencies at each locus for each source population and is a \(K × 3L\) matrix. While \(P\) may contain some missing values, the product matrix, \(QF\), is always a complete probabilistic matrix. \(P\) is a \(n × 3L\) matrix.
Comment: The matrix \(F\) is called the matrix \(G\) in the LEA package and in Frichot et al. (2013).
Code
proj_snmf <-snmf(here("data/LEAanalysis/AlleleCounts_GenoFormat.geno"), K=1:10, repetitions =10, # nb of repetitions for each Kentropy =TRUE,project="new")
We plot the cross-entropy criterion of all run of the project.
Code
plot(proj_snmf, cex =1.2, col ="lightblue", pch =19)
We get the cross-entropy of the 10 runs for K = 6 and we select the run with the lowest cross entropy
Code
ce <-cross.entropy(proj_snmf, K =6)best <-which.min(ce)
We display the Q-matrix.
Code
gp_colors <-c("orangered3","gold2","darkorchid3","navyblue","turquoise2","green3") # define the colors of the gene poolsbp <-barchart(proj_snmf, K =6, run= best,border =NA, space =0, col = gp_colors, xlab ="Individuals", ylab ="Ancestry proportions", main ="Ancestry matrix")axis(1, at =1:length(bp$order), labels = bp$order, las =3, cex.axis = .4)
We use the G function to extract the ancestral genotype frequency matrix, \(G\), for the 2nd run for K = 6.
Code
Gmatrix <-G(proj_snmf, K =6, run = best)Gmatrix[1:5,1:6] %>%as.data.frame() %>%kable_mydf(boldfirstcolumn = F)
V1
V2
V3
V4
V5
V6
0.43
1
1.0
1.00
1
1.00
0.49
0
0.0
0.00
0
0.00
0.08
0
0.0
0.00
0
0.00
0.08
1
0.7
0.68
1
0.33
0.37
0
0.3
0.28
0
0.50
3 Climatic data
We load the population-specific climatic information for the climatic variables of interest. To run lfmm2, individuals (genotypes in our case) have to be in rows and climatic variables in columns.
The past and future climatic data have been scaled with the parameters (mean and variance) of the past climatic data (which is done by the function generate_clim_datatsets).
Code
# Selected climatic variables# ===========================clim_var <-readRDS(here("data/ClimaticData/NamesSelectedVariables.rds"))# Past and future climatic data# =============================source(here("scripts/functions/generate_scaled_clim_datasets.R"))clim_dfs <-generate_scaled_clim_datasets(clim_var)
We attribute climatic values to each genotype, i.e. genotypes from the same populations will have the same climatic values.
We load the genomic data. The genomic has to be allele counts without missing data, with individuals (genotypes) in rows and SNPs in columns.
Code
# we load the imputed genomic data with allele counts (and without MAF)geno <-read.csv(here("data/DryadRepo/ImputedGenomicData_AlleleCounts_withoutmaf.csv")) %>%column_to_rownames("snp_ID") %>%t() %>%as_tibble()
4.2 Run lfmm2
From Caye et al. (2019): LFMMs are regression models combining fixed and latent effects as follows:
\(\mathbf{Y}\) is the response matrix, which records data for \(n\) individuals genotyped for \(p\) genetic markers. \(\mathbf{X}\) is the matrix of the environmental or primary variables. Nuisance variables such as observed confounders can be included in the \(\mathbf{X}\) matrix, which dimension is then \(n \times d\), where \(d\) represents the total number of primary and nuisance variables. The fixed effect sizes are recorded in the \(\mathbf{B}\) matrix, which has dimension \(p \times d\). The \(\mathbf{E}\) matrix represents residual errors, and it has the same dimensions as the response matrix. The matrix \(\mathbf{W}\) is a “latent matrix” of rank \(K\),defined by \(K\) latent factors. The \(K\) latent factors represent unobserved confounders which are modeled through an \(n \times K\) matrix, \(\mathbf{U}\).The matrix \(\mathbf{U}\) is obtained from a singular value decomposition (SVD) of the matrix \(\mathbf{W}\) as follows
\[\mathbf{W} = \mathbf{UV}^T \]
where \(\mathbf{V}\) is a \(p \times K\) matrix of loadings. The \(\mathbf{U}\) and \(\mathbf{V}\) matrices are unique up to arbitrary signs for the factors and loadings.
As there are 6 gene pools in maritime pine (all represented in our population sample), we run the LFMM model with K=5.
Code
mod_lfmm2 <-lfmm2(input = geno,env = clim_ref, K =5)
The function lfmm2 returns an object of class lfmm2Class that contains the \(\mathbf{U}\) and \(\mathbf{V}\) matrices.
4.3 Calibration issues
With the function lfmm2.test, we can get a vector of p-values for association between loci and climatic variables adjusted for latent factors computed by lfmm2.
The full option:
If FALSE, the function lfmm2.test computes significance values (p-values) from standard Student tests for each climatic variable.
If TRUE, the function lfmm2.test returns p-values for the full set of climatic variables (a single value at each locus) using Fisher tests.
The genomic.control option:
If TRUE (default option), the p-values are recalibrated by using genomic control after correction for confounding.
If FALSE, the p-values are not recalibrated.
We can check if the p-values are well calibrated or not with the histograms of the p-values: ideally, they should be flat with a peak close to zero. in the two graphs below, we show the distribution of the non-calibrated (left graph) and calibrated (right graph) p-values. We can see that it is important to set the genomic.control to its default value TRUE if we want the p-values to be well calibrated.
Code
par(mfrow=c(1,2))# Histogram of non-calibrated p-values# ------------------------------------lfmm2.test(object = mod_lfmm2, input = geno, env = clim_ref, full =TRUE, genomic.control =FALSE)$pvalues %>%hist(col ="orange", main="Histogram of non-calibrated p-values",xlab="p-values")# Histogram of calibrated p-values# --------------------------------lfmm2.test(object = mod_lfmm2, input = geno, env = clim_ref, full =TRUE, genomic.control =TRUE)$pvalues %>%hist(col ="orange", main="Histogram of calibrated p-values",xlab="p-values")
In the following analyses, we use the default genomic.control=TRUE and full=TRUE, so all climatic variables are used in the test.
Code
test_lfmm2 <-lfmm2.test(object = mod_lfmm2,input = geno,env = clim_ref, full =TRUE,genomic.control =TRUE)pv_lfmm2 <- test_lfmm2$pvaluesplot(-log10(pv_lfmm2 ), cex = .3, col ="blue",xlab ="Locus", ylab ="-log10(P)", main="Manhattan plot of log10 p-values")
4.4 Multiple testing and calibration issues
We want to use the FDR control algorithm to correct for multiple testing and determine which loci have significant levels of association. The False Discovery Rate (FDR) is defined as:
FDR = prob(False Discovery | Positive test) = q.
The FDR algorithm requires that the tests are correctly calibrated, i.e. that the distribution of p-values is uniform when we assume that the null hypothesis, \(H0\), is correct. That’s ok in our case, we have already checked it above with the histogram of p-values.
Which FDR level do we use?
Code
fdr_level <-0.1
To identify the candidate SNPs, we apply the chosen FDR control to the p-values, which converts them into q-values. And then we identify candidates as those with q-values below a given FDR threshold.
The candidate SNPs at the FDR level of 10% are shown with circles on the Manhattan plot below. The orange line corresponds to the Bonferroni threshold for a type I error of 10%.
Code
# applying FDR control method to obtain q-valuesqv_lfmm2 <- qvalue::qvalue(pv_lfmm2, fdr.level = fdr_level)# Manhattan plotplot(-log10(pv_lfmm2 ), cex = .3, col ="blue",xlab ="Locus", ylab ="-log10(P)", main="Manhattan plot of log10 p-values")# Show with an orange line the Bonferonni multiple testing threshold for significanceabline(h =-log10(0.1/ncol(geno)), col ="orange")# Extract the list of candidatescandidates <-which(qv_lfmm2$significant)# Show with circles the list of candidate loci at the chosen FDR levelpoints(candidates, -log10(pv_lfmm2)[candidates], cex = .9, col ="brown")
Until now, methods to estimate the genomic offset has defined it as a distance in the genetic space (i.e. distance between allele frequencies). In the R packae LEA, the genomic offset is alternatively defined as a distance in the environmental space, i.e. measures of genomic offset are linked to the geometry of the ecological niche. This new definition of the genomic offset is referred as genetic gap and is implemented in the genetic.gap function of the LEA package.
The genetic gap is based on the estimates of environmental effect sizes obtained from an LFMM. The relationship inferred in the GEA is then used to fit and predict allelic variation at all genomic loci, alleviating the need for a set of candidate loci and the choice of a significance level.
Below, we will estimate the genetic gap for:
the sets of candidate SNPs
the control set of SNPs
all SNPs (as advised by the developers of the genetic.gap function)
5.1 Loading future climatic data
We load the predicted values of the climatic variables for the years 2041-2070 that have been scaled (i.e. mean-center) with the same scaling parameters as the one used to mean-center the annual climatic values across the period 1901-1950.
Code
# Attribute climatic values for each genotypelist_clim_pred <-lapply(clim_dfs[[2]], function(clim_pred){genotypes %>%mutate(pop =str_sub(clon,1,3)) %>%left_join(clim_pred, by="pop") %>% dplyr::select(-pop,-clon,-gcm)})
5.2 GO predictions using all SNPs
We run the genetic.gap function.
Code
ggap_all_snps <-lapply(list_clim_pred, function(clim_pred){ggap <-genetic.gap(input = geno,env = clim_ref,pred.env = clim_pred,K =5)# we keep one value per populationggap[[1]] <-unique(ggap[[1]])ggap[[2]] <-unique(ggap[[2]])names(ggap)[[1]] <-"go"return(ggap)})
The outputs of the genetic.gap function are (copy-and-pasted from the LEA manual):
offset: a vector of genomic offset values computed for every sample location in new.env and pred.env. Note that the genomic offset is referred to as the genetic gap in Gain and François (2021) and the geometric GO in Gain et al. (2023).
distance: a vector of environmental distance values computed for every sample location in new.env and pred.env. The distances to an estimate of the risk of nonadaptedness that includes correction for confounding factors and analyzes multiple predictors simultaneously (modified version of RONA).
eigenvalues: eigenvalues of the covariance matrix of LFMM effect sizes. They represent the relative importance of combinations of environmental variables described in vectors when the environmental data have similar scales. To be used with scale == TRUE.
vectors: eigenvectors of the covariance matrix of LFMM effect sizes representing combinations of environmental variables sorted by importance (eigenvalues).
5.2.0.1 Relationship with Euclidean climatic distance
In the supplementary information of Gain et al. (2023), the authors say that ‘If (and only if) the eigenvalues of the covariance matrix are equal, the geometric GO is proportional to the squared Euclidean distance between environmental predictors.’ Note that a variable \(y\)is proportional to a variable \(x\) if there is a non-zero constant \(k\) such as \(y = kx\).
In this part, we look at the relationship between the squared root of the genetic gap vs the Euclidean climatic distance (or the genetic gap and the squared Euclidean climatic distance).
The Euclidean climatic distance is calculated as follows:
with \(N\) the number of selected climatic variables, \(X_{past,i}\) the mean value of the climatic variable \(i\) across the period 1901-1950 and \(X_{fut,i}\) the predicted mean value of the climatic variable \(i\) across the period 2041-2070.
# Incorporating gene pool information# -----------------------------------# In the graphs below, we want to color the populations according to the main gene pool they belong to# we load the main gene pool information for each clonegps <-readRDS(here("data/GenomicData/MainGenePoolInformation.rds"))[[1]] %>%arrange(pop)
Correlation between the squared Euclidean climatic distance and the genetic gap (offset in the outputs):
Comment: The squared Euclidean climatic distance and the genetic gap are somewhat correlated (the strength of the correlation depending on whether we include or not the outlier pops), so it means that the genetic gap \(GO\) can be expressed as follows: \(GO = a \times dist^2 + b + e\) with \(dist\) being the Euclidean climatic distance and the \(e\) the residuals (ie the noise). Note that \(GO\) and \(dist^2\) are correlated but not proportional to!
We can plot the eigenvalues with a barplot to evaluate the relative importance of the climatic variables.
Code
# the relative importance is the same for all genomic offset predictions at it is only influenced by past climates, and not future climatesbarplot(ggap_all_snps[[1]]$eigenvalues, col ="orange", xlab ="Axes", ylab ="Eigenvalues")
We see that only two dimensions of the climatic space influence the genetic gap. We can also see it with the loadings for the first two combinations of variables (vectors in the outputs of the genetic.gap function), which indicate the relative contribution of the variables to local adaptation. We see that the first two variables had increased importance compared to the last ones (the variables are sorted by importance in the vectors outputs, so we do not know which variables they are).
Code
# Function to build the plots source(here("scripts/functions/make_eucli_plot.R"))
Code
par(mfrow=c(1,2))# plot the squared root of the genetic gap vs Euclidean environmental distancelapply(names(ggap_all_snps), function(gcm){make_eucli_plot(X = list_dist_env[[gcm]],Y =sqrt(ggap_all_snps[[gcm]]$go),colors = gps$color_main_gp_pop, color_names = gps$main_gp_pop,ylab ="sqrt(genetic gap)",plot_title =paste0("All SNPs - ",gcm)) })
5.2.0.2 Maps
Code
# Function to make the genomic offset mapssource(here("scripts/functions/make_go_map.R"))# Population coordinatespop_coord <-readRDS(here(paste0("data/ClimaticData/MaritimePinePops/ClimatePopulationLocationPointEstimates_ReferencePeriods_noADJ.rds")))[[1]]$ref_means %>% dplyr::select(pop,longitude,latitude)# Find minimum and maximum values of genomic offset for the mapsgo_limits <-lapply(names(ggap_all_snps), function(gcm){ ggap_all_snps[[gcm]]$go}) %>%unlist() %>%range()# The minimum GO value is very very small, almost zero, so we fix it to zero.go_limits[[1]] <-0list_go_all_snps <-list(go =lapply(names(ggap_all_snps), function(gcm){ggap_all_snps[[gcm]]$go}) %>%setNames(names(ggap_all_snps)))# Generate the maps for each set of SNPs and each GCMgo_maps <-lapply(names(ggap_all_snps), function(gcm){make_go_map(dfcoord=pop_coord,snp_set = list_go_all_snps,gcm=gcm,ggtitle=gcm,go_limits = go_limits,point_size =3)})legend_maps <-get_legend(go_maps[[1]])go_maps <-lapply(go_maps, function(y) y +theme(legend.position ="none"))go_maps$legend_maps <- legend_mapsgo_maps <-plot_grid(plotlist=go_maps)title <-ggdraw() +draw_label("All SNPs",fontface ='bold',x =0,hjust =0 ) +theme(plot.margin =margin(0, 0, 0, 7))# merge title and plotsplot_grid( title, go_maps,ncol =1,rel_heights =c(0.1, 1))
5.3 GO predictions using sets of SNPs
We estimate the genetic gap for three sets of SNPs:
Comment: The eigenvalues are properties of the model fit, and so only depends on the genomic data and the climate across the reference period. In other words, they do not vary across the different GCMs (ie different future climatic data). That’s why we generate only three barplots, one for each set of SNPs.
Code
par(mfrow=c(1,3))lapply(snp_sets, function(snp_set){ggap <-genetic.gap(input = geno,env = clim_ref,pred.env = list_clim_pred[[1]], # can be any element of the list, it does not mattercandidate.loci =which(names(geno) %in% snp_set$set_snps),K =5)barplot(ggap$eigenvalues, col ="orange", xlab ="Axes", ylab ="Eigenvalues",main=snp_set$set_name)})
Correlations between the squared Euclidean climatic distance and the genetic gap:
# Find minimum and maximum values of genomic offset for the mapsgo_limits <-lapply(snp_sets, function(x) {lapply(names(list_clim_pred), function(gcm){x$go[[gcm]]}) %>%unlist()}) %>%unlist() %>%range()# The minimum GO value is very very small, almost zero, so we fix it to zero.go_limits[[1]] <-0# Generate the maps for each set of SNPs and each GCMlapply(snp_sets, function(x) {go_maps <-lapply(names(list_clim_pred), function(gcm){make_go_map(dfcoord=pop_coord,snp_set = x,gcm=gcm,ggtitle=gcm,go_limits = go_limits,point_size =3)})legend_maps <-get_legend(go_maps[[1]])go_maps <-lapply(go_maps, function(y) y +theme(legend.position ="none"))go_maps$legend_maps <- legend_mapsgo_maps <-plot_grid(plotlist=go_maps)title <-ggdraw() +draw_label( x$set_name,fontface ='bold',x =0,hjust =0 ) +theme(plot.margin =margin(0, 0, 0, 7))# merge title and plotsplot_grid( title, go_maps,ncol =1,rel_heights =c(0.1, 1)) })
6 Validation - NFI plots
Code
# Load the climatic data of the NFI plots.nfi_clim <-readRDS(here("data/ClimaticData/NFIplots/NFIclimate.rds"))# Keep only the climatic variables of interest and scale the climatic datanfi_dfs <-generate_scaled_clim_datasets(clim_var, clim_ref = nfi_clim$clim_ref, clim_pred = nfi_clim$clim_survey)# Calculate the genomic offset for the NFI plotssnp_sets <-lapply(snp_sets, function(snp_set){ggap <-genetic.gap(input = geno,env = clim_ref,new.env = nfi_dfs$clim_ref[,-1],pred.env = nfi_dfs$clim_pred[,-1],candidate.loci =which(names(geno) %in% snp_set$set_snps),K =5)snp_set$go_nfi <- ggap$offsetreturn(snp_set)})# checking missing data# lapply(snp_sets, function(x) sum(is.na(x$go_nfi)))# Find minimum and maximum values of genomic offset for the mapsgo_limits <-lapply(snp_sets, function(snp_set) snp_set$go_nfi) %>%unlist() %>%range()# The minimum GO value is very very small, almost zero, so we fix it to zero.go_limits[[1]] <-0# map genomic offset predictions in the NFI plots lapply(snp_sets, function(x) make_go_map(dfcoord=readRDS(here("data/ClimaticData/NFIplots/NFIclimate.rds"))[[1]] %>% dplyr::select(contains("ude")), snp_set = x,type="NFI",point_size =0.5,go_limits = go_limits,legend_position =c(0.85,0.15),legend_box_background ="gray80",y_limits =c(35, 51)))
7 Validation - Common gardens
Code
# Common garden information# =========================cg_clim <-readRDS(here("data/ClimaticData/CommonGardens/ClimateCG.rds")) %>% dplyr::select(cg,any_of(clim_var))cg_coord <-readRDS(here("data/ClimaticData/CommonGardens/ClimateCG.rds")) %>% dplyr::select(cg,contains("ude"))cg_names <-unique(cg_coord$cg)# Climatic datasets for the predictions# =====================================# In these datasets, one row per prediction# We want GO predictions for each combination pop * CG, so 34 * 5 = 160 rows# climatic data of the populations# each row (ie each population climatic data) is repeated 5 timesclim_pop <-generate_scaled_clim_datasets(clim_var, clim_pred = cg_clim)[[1]] %>%slice(rep(1:n(), each =nrow(cg_clim)))# Climatic data in the common gardens scaled with the mean and variance of the ref-period climatic data# each climatic dataset is repeated 34 times and then rows are combinedclim_cg <-generate_scaled_clim_datasets(clim_var, clim_pred = cg_clim)[[2]] %>%replicate(nrow(pop_coord),.,simplify=F) %>%bind_rows()# Predict genomic offset of each population when transplanted in the climate of the common gardenssnp_sets <-lapply(snp_sets, function(snp_set){ggap <-genetic.gap(input = geno,env = clim_ref,new.env = clim_pop[,-1],pred.env = clim_cg[,-1],candidate.loci =which(names(geno) %in% snp_set$set_snps),K =5)snp_set$go_cg <-bind_cols(pop=clim_pop[,1],cg=clim_cg[,1]) %>%mutate(go=ggap$offset) %>%pivot_wider(names_from = cg, values_from = go)return(snp_set)})# Mapping# ========go_maps_cg <-lapply(cg_names, function(cg_name){# Find minimum and maximum values of genomic offset for the mapsgo_limits <-lapply(snp_sets, function(snp_set) snp_set$go_cg[[cg_name]]) %>%unlist() %>%range()go_limits[[1]] <-0p <-lapply(snp_sets, function(x) make_go_map(dfcoord=pop_coord, snp_set = x,point_size =3,type="CG",go_limits = go_limits,cg_name=cg_name,cg_coord=cg_coord))plot_grid(p[[1]],p[[2]],p[[3]],nrow=1)}) %>%setNames(cg_names)pdf(here("figs/LEA/GOmaps_CGs.pdf"), width=15,height=6)lapply(go_maps_cg, function(x) x)dev.off()# show mapslapply(go_maps_cg, function(x) x)
Let’s save the genomic offset predictions for comparison with the other methods.
In a similar way to Gain et al. (2023) and Capblancq et al. (2023), we want to use allele frequencies corrected for population structure in the Gradient Forest analysis. Here we use the outputs from a latent factor mixed model (LFMM) to extract the corrected allele frequencies.
To understand, here the model applied by the lfmm2 function:
\[\mathbf{Y}_{fit} = \mathbf{XB}^T + \mathbf{UV}^T\] where \(\mathbf{B}\), \(\mathbf{U}\) and \(\mathbf{V}\) are the effect size, and factor and loading matrices adjusted by the lfmm2 algorithm from the set of current environmental variables included in the matrix \(\mathbf{X}\). \(\mathbf{B}\) is a matrix of dimension \(p \times b\), with \(p\) the number of genetic markers and \(b\) the number of environmental variables. \(\mathbf{U}\) is a matrix of dimension \(n \times K\), with \(n\) the number of individuals (ie genotypes) and \(K\) the number of latent factors. \(\mathbf{V}\) is a matrix of dimension \(p \times K\). \(\mathbf{X}\) is a matrix of dimension \(n \times b\). \(\mathbf{Y}_{fit}\) is a matrix of dimension \(n \times p\).
We want a matrix of allele frequencies corrected for population structure, so basically:
\[\mathbf{Y}_{corrected} = \mathbf{Y}_{fit} - \mathbf{UV}^T = \mathbf{XB}^T\] Below, we do the matrix multiplication of the matrix \(\mathbf{X}\) (dimension \(n \times b\)) and the transpose of the matrix \(\mathbf{B}\) (dimension \(b \times p\)) to obtain the matrix \(\mathbf{Y}_{corrected}\) (dimension \(n \times p\)).
Code
# we load the imputed genomic data with allele counts (and with MAF)geno_withmaf <-read.csv(here("data/DryadRepo/ImputedGenomicData_AlleleCounts_withmaf.csv")) %>%column_to_rownames("snp_ID") %>%t()# We run LFMM mod_withmaf <-lfmm2(input = geno_withmaf,env =as.matrix(clim_ref),K =5,effect.sizes = T) # to return the matrix B (the matrix of effect sizes)# Matrix multiplication of matrix X and the transpose of matrix Bgeno_corrected <-as.matrix(clim_ref) %*%t(mod_withmaf@B) %>%set_rownames(rownames(geno_withmaf)) %>%as.data.frame() %>%rownames_to_column(var="clon")# save the matrix for future analyses with and without MAFgeno_corrected %>%write_csv(here("data/DryadRepo/CorrectedAlleleFrequencies_withmaf.csv"))geno_corrected %>% dplyr::select(clon,all_of(colnames(geno))) %>%write_csv(here("data/DryadRepo/CorrectedAlleleFrequencies_withoutmaf.csv"))geno_corrected[1:10,1:10] %>%kable_mydf(boldfirstcolumn =1)
clon
snp_1
snp_2
snp_3
snp_4
snp_5
snp_6
snp_7
snp_8
snp_9
ALT10
0.01
0
0.12
-0.02
0.07
0.14
0.02
-0.01
-0.04
ALT2
0.01
0
0.12
-0.02
0.07
0.14
0.02
-0.01
-0.04
ALT3
0.01
0
0.12
-0.02
0.07
0.14
0.02
-0.01
-0.04
ALT4
0.01
0
0.12
-0.02
0.07
0.14
0.02
-0.01
-0.04
ALT5
0.01
0
0.12
-0.02
0.07
0.14
0.02
-0.01
-0.04
ALT6
0.01
0
0.12
-0.02
0.07
0.14
0.02
-0.01
-0.04
ALT8
0.01
0
0.12
-0.02
0.07
0.14
0.02
-0.01
-0.04
ALT9
0.01
0
0.12
-0.02
0.07
0.14
0.02
-0.01
-0.04
ARM1
0.01
0
0.14
-0.04
0.07
0.14
0.00
-0.01
-0.06
ARM10
0.01
0
0.14
-0.04
0.07
0.14
0.00
-0.01
-0.06
References
Capblancq, Thibaut, Susanne Lachmuth, Matthew C Fitzpatrick, and Stephen R Keller. 2023. “From Common Gardens to Candidate Genes: Exploring Local Adaptation to Climate in Red Spruce.”New Phytologist 237 (5): 1590–1605.
Caye, Kevin, Basile Jumentier, Johanna Lepeule, and Olivier François. 2019. “LFMM 2: Fast and Accurate Inference of Gene-Environment Associations in Genome-Wide Studies.”Molecular Biology and Evolution 36 (4): 852–60.
Frichot, Eric, Sean D Schoville, Guillaume Bouchard, and Olivier François. 2013. “Testing for Associations Between Loci and Environmental Gradients Using Latent Factor Mixed Models.”Molecular Biology and Evolution 30 (7): 1687–99.
Gain, Clément, and Olivier François. 2021. “LEA 3: Factor Models in Population Genetics and Ecological Genomics with R.”Molecular Ecology Resources 21 (8): 2738–48.
Gain, Clément, Benedicte Rhone, Philippe Cubry, Israfel Salazar, Florence Forbes, Yves Vigouroux, Flora Jay, and Olivier François. 2023. “A Quantitative Theory for Genomic Offset Statistics.”bioRxiv, 2023–01.
Jaramillo-Correa, Juan-Pablo, Isabel Rodrı́guez-Quilón, Delphine Grivet, Camille Lepoittevin, Federico Sebastiani, Myriam Heuertz, Pauline H Garnier-Géré, et al. 2015. “Molecular Proxies for Climate Maladaptation in a Long-Lived Tree (Pinus Pinaster Aiton, Pinaceae).”Genetics 199 (3): 793–807.
Pritchard, Jonathan K, Matthew Stephens, and Peter Donnelly. 2000. “Inference of Population Structure Using Multilocus Genotype Data.”Genetics 155 (2): 945–59.